Advanced Text Analysis

SICSS-Munich, Day 4


Session 4️⃣: Embeddings

Valerie Hase (LMU Munich)

github.com/valeriehase

valerie-hase.com

Agenda

  • Words as vectors
  • Introduction to (word) embeddings
  • When and how to use embeddings?
  • Technical set-up & researcher decisions
  • Promises & Pitfalls
  • The road ahead

Words as Vectors

  • Transfer the idea of vector space models to words: Can we represent words in a \(N\)-dimensional vector space?

  • Preferably, in a way that…

    • is computationally more efficient (i.e., not using every unique word as a dimension)
    • better captures semantic meaning (i.e., represents similarities between similar words)
  • Idea 💡: We can represent each word as a function of the words around it, i.e., its context.

Words as Vectors in R

  • A simple example: We describe each word through the words before and after it (e.g., through a window of size 2).

  • To do so, we rely on co-occurrence matrix fcm() indicating co-occurrences of words within a window of 2.

  • Let’s stick to our three exemplary sentences from session 3️⃣

Code
library("quanteda")
library("tidyverse")
fcm <- c("I like fruit like apple and kiwi",
         "I like kiwi, only kiwi",
         "I only like vegetables like potato") %>%
  tokens() %>%
  fcm(context = c("window"), window = 2) %>%
  convert(to = "data.frame")

head(fcm)
  doc_id I like fruit apple and kiwi , only vegetables potato
1      I 0    3     1     0   0    1 0    1          0      0
2   like 0    4     2     1   1    1 1    1          2      1
3  fruit 0    0     0     1   0    0 0    0          0      0
4  apple 0    0     0     0   1    1 0    0          0      0
5    and 0    0     0     0   0    1 0    0          0      0
6   kiwi 0    0     0     0   0    0 2    2          0      0

Words as Vectors

  • Let’s take the words apple and potato and compare them across the dimensions fruit as the \(x\)-axis and vegetables as the \(y\)-axis of our vector space

  • In short, we describe apple and potato via their co-occurrence with other words

Words as Vectors

We can extend the number of words we include and the number of dimensions alongside which we map them:

Welcome to the world of word embeddings! ✨

Image of a n N-Dimensional Space

Going beyond bag-of-words

  • Identifying meaning through ngrams (Session 2️⃣)
  • Identifying meaning through syntax (Session 2️⃣)
  • Identifying meaning through semantic spaces (Session 3️⃣, 4️⃣)

Introduction to Word Embeddings

Word Embeddings…

➡️ Basically: Explain meaning of words via word(s) next to it!

➡️ Embeddings also work with other content (e.g., images) but for now ignored

Introduction to Word Embeddings

  • Embeddings are learned from data as a form of self-supervised learning: The meaning of a word is learned from its context

  • Let’s try an example

Introduction to Word Embeddings

  • Embeddings as more dense, shorter vectors for representing words in a \(N\)-dimensional space, with these dimensions encoding meaning

    • dense because they rely on continuous values [0.3, -0.1, 1.9] instead of one-hot-encoding [0, 1, 0]
    • short because the number of dimensions \(N\) is no longer = \(|V|\) (vocabulary) but reduced to a smaller set of dimensions (often couple 100)
  • Let’s check a nicer visualization

  • Idea somewhat similar to other dimension-reducing approaches, e.g. PCA

How could we use these methods for social science questions? 🤔

When and how to use embeddings?

According to Rodriguez & Spirling (2022):

  • instrumental use: as features in downstream analysis, for expanding dictionaries

  • direct measures of meaning: as central object of study

Instrumental Use: Sentiment Analysis

Use embeddings as features for downstream, supervised ML-based sentiment analysis of plenary speeches from Austria (Rudkowsky et al., 2018):

Figure 2 from Rudkowsky et al., 2018

Note. Figure from Rudkowsky et al. (2018, p. 144).

Instrumental Use: Sentiment Analysis

Use embeddings as features for downstream, supervised ML-based sentiment analysis of plenary speeches from Austria (Rudkowsky et al., 2018):

Table 2 and 3 from Rudkowsky et al., 2018

Note. Figure from Rudkowsky et al. (2018, p. 145).

Direct measures of meaning: Stereotypes

Use embeddings as direct measures of stereotypes in society partly based on Google News (Garg et al., 2018; see also Kroon et al., 2019; Müller et al., 2023):

Figure 1 from Garg et al., 2018

Note. Figure from Garg et al. (2018, p. E3636).

Direct measures of meaning: Stereotypes

Use embeddings as direct measures of stereotypes in society partly based on Google News (Garg et al., 2018; see also Kroon et al., 2019; Müller et al., 2023):

Table 1 from Garg et al., 2018

Note. Figure from Garg et al. (2018, p. E3638).

Direct measures of meaning: Shifts over time

Use embeddings as direct measures of how meaning changes in societies partly based on Google Ngram corpus (mostly books) (Kozlowski et al., 2019; see also Hamilton et al., 2016; Rodman, 2020):

Figure 10 from Kozlowski et al., 2019

Note. Figure from Kozlowski et al. (2019, p. 928).

Direct measures of meaning: Shifts over time

Use embeddings as direct measures of how meaning changes in societies partly based on Google Ngram corpus (mostly books) (Kozlowski et al., 2019; see also Hamilton et al., 2016; Rodman, 2020):

Figure 10 from Kozlowski et al., 2019

Note. Figure from Kozlowski et al. (2019, p. 928).

Technical set-up & researcher decisions

  1. Estimate embeddings
  2. Aggregate, for instance from word to document level (Le & Mikolov, 2014)
  3. Potentially visualize/cluster results
  4. Evaluate alongside quality criteria (validity, reliability)

Main decisions for step 1: Estimation

According to Rodriguez & Spirling (2022), …

  1. Choice of window size
  2. Choice of embedding dimensions
  3. Pretrained or locally trained model
  4. Type of model

Main decisions for step 1: Estimation

  1. Choice of window size (see also Jurafsky & Martin, 2023)

    • Small window: syntactically similar

    • Large window: topically similar

✅ Advice by Rodriguez & Spirling (2022): avoid small windows (<5)

Main decisions for step 1: Estimation

  1. Choice of embedding dimensions, which differ based on model

    • Few dimensions: may not capture meaning

    • Many dimensions: may model noise

✅ Advice by Rodriguez & Spirling (2022): avoid few dimensions (<100)

What do embedding dimensions stand for? 🤔

Main decisions for step 1: Estimation

  1. Pretrained or locally trained model

    • Pretrained: cheap

    • Locally trained: better if corpus very different from training corpus

✅ Advice by Rodriguez & Spirling (2022): You can use pre-trained models, (difference to e.g. off-the-shelf dictionaries, which you should avoid) unless you expect domain-specific use of features

Main decisions for step 1: Estimation

  1. Type of model

Exemplary Model: Word2Vec

  • Google’s Word2Vec (Le & Mikolov, 2014; Mikolov et al., 2013) is a predictive model: treats embeddings as binary prediction task (see further Jurafsky & Martin, 2023):

    • Is word \(w_i\) likely to appear next to \(w_{i-1}\) and \(w_{i+1}\)? (local context, here window of 1)

    • Use trained classifier weights as word embeddings: self-supervision

  • Two algorithms:

    • Skip-gram (given \(w_i\), predict context)

    • Continuous-bag-of-words (given context, predict \(w_i\))

  • Advantageous over sparse VSM, but: inefficient for new words, static embedding (i.e., same meaning of word across contexts)

Exemplary Model: GloVe

  • Stanford’s GloVe (Pennington et al., 2014) is count-based: relies on co-occurrence matrix & focuses on ratio of co-occurrence probabilities:

    • Can we explain word \(w_i\) via its co-cocurrence with all other words? (global context)
  • Slightly more robust than Word2Vec (Rodriguez & Spirling, 2022), but: inefficient for new words, static embedding

Further extensions: FastText & A la Card

  • Facebook’s fasttext (Bojanowski et al., 2017):
    • Uses more fine-grained information based on char2vec approach: we can allocate new words in vector space
  • A la Carte Embedding (Khodak et al., 2018):
    • Takes pretrained embeddings (e..g, Word2Vec) & combines with example uses of a word to create context-specific embeddings

Main decisions for step 2: Aggregation

  • After estimating word embeddings (Step 1): How can we get aggregate measures for each document?

  • Take mean of word vectors (see for example Rudkowsky et al., 2018)

  • Use doc2vec to create document-specific embeddings (Le & Mikolov, 2014)

Main decisions for step 3: Visualizing/Clustering

  • After creating document embeddings (Step 2): How can we visualize/cluster/understand results?
    • Dimension-reduction technique like UMAP to reduce to 2-dimensional space
    • Often in addition to cluster analysis
  • We can plot results using common R packages like ggplot

Main decisions for step 4: Evaluation

Second dataset for today

  • We’ll use data provided by the quanteda.corpora package (install directly via Github using devtools)
  • UK news coverage by the Guardian on immigration
  • Corpus contains N = 6,000 articles
Code
library("quanteda.corpora")
corpus_news <- download("data_corpus_guardian")

Step 1: Estimation in R

  • We will now use the GloVe model.
  • For the sake of explanation, this is going to be a highly simplistic model: the goal is to roughly understand, not to perfectionize how to estimate embeddings in R (much could be improved: preprocessing, estimation, etc.)
  • First, we create a co-occurrence matrix:
Code
library("dplyr")
library("quanteda")
library("text2vec")
library("irlba")
library("purrr")
library("ggplot2")

dfm <- corpus_news %>%
  tokens() %>%
  fcm(context = "window",
      count = "weighted")

Step 1: Estimation in R

  • Next, we estimate embeddings using the text2vec package (which itself draws on the rsparse package, more here)
  • This code partly draws on similar tutorials, e.g., here, here, and here
Code
#Create GloVe model object
glove <- GloVe$new(rank = 50, #desired dimensions
                   x_max = 10) #maximum number of co-occurrences used for weighting

#Fit model and return embeddings
embeddings <- glove$fit_transform(dfm, #input dfm
                                  n_iter = 10, #number of iterations
                                  n_threads = 8) #number of threads to use

#Model returns two vectors - we average them, similar to GloVe paper
word_vectors <- embeddings + t(glove$components)

Step 1: Estimation in R

  • Let’s check results: Let’s have a look at the first five dimensions of the embeddings for the first five features
Code
head(word_vectors)[1:5, 1:5]
                 [,1]        [,2]        [,3]       [,4]        [,5]
London      0.5839257 -0.54383775  0.03708588 -1.0679305  0.78105767
masterclass 0.5033765  0.13811949 -0.58227950  0.3053519  0.56911472
on          0.6319902 -0.74350308  0.42888036  0.1069440 -0.30850540
climate     0.1485323 -0.37760596  1.11649407  0.3183808  0.30814974
change      0.1557727 -0.07244418  0.75674176  0.1011174 -0.04583707

Step 1: Estimation in R

  • Ok, that’s not telling us much.
  • Can we use the embeddings to see which features are used in similar context?
  • Knowing that this is a corpus on immigration, we may be interested in stereotypical associations between the word terrorism and other words, such as immigration. Is there indication for such stereotypical reporting?
Code
feature <- word_vectors["terror", , drop = FALSE] 

#find similar features in corpus
cos_sim <- sim2(x = word_vectors, 
                y = feature, 
                method = "cosine", 
                norm = "l2")

#Focus on five most similar features
head(sort(cos_sim[,1], decreasing = TRUE), 5)
   terror terrorist   attacks    attack terrorism 
1.0000000 0.8517235 0.7734465 0.6652497 0.6601173 

Step 3: Visualizing/Clustering in R

  • More broadly: What other features are associated with the term terrorism? Can we visualize results?
  • To do so, we first have to reduce our results to 2 (plottable) dimensions, here via the irlba package
Code
word_vectors_reduced <- irlba(word_vectors, 2) %>% #dimension reduction
  pluck("u") %>% #get relevant object from list
  as_tibble() %>%
  mutate(feature = rownames(word_vectors)) #add feature names as variable

#check results
head(word_vectors_reduced)
# A tibble: 6 × 3
         V1        V2 feature    
      <dbl>     <dbl> <chr>      
1  0.0146    0.00904  London     
2 -0.000435 -0.00103  masterclass
3  0.0254    0.00172  on         
4  0.0123    0.00110  climate    
5  0.0149    0.000456 change     
6  0.0213    0.0224   |          

Step 3: Visualizing/Clustering in R

  • Let’s grab some specific features which may (or may not) indicate stereotypical reporting!
  • We throw in a couple of (presumably) unrelated words to understand differences
  • We use ggplot to visualize results
Code
word_vectors_reduced %>%
  filter(feature %in% c("immigration", "migration",
                        "refugee", "islam", 
                        "terrorist", "terror",
                        "Paris", "Berlin")) %>% #only works for features existing in corpus
  ggplot(aes(x = V1, y = V2, label = feature)) +
  geom_text(aes(label = feature), hjust = 0, vjust = 0, color = "black") +
  theme_classic() +
  xlab("Dimension 1") +
  ylab("Dimension 2")

Step 3: Visualizing/Clustering in R

  • Let’s grab some specific features which may (or may not) indicate stereotypical reporting!
  • We throw in a couple of (presumably) unrelated words to understand differences
  • We use ggplot to visualize results

Summary: Application in R

  • Again: this is a very much simplified example for how to estimate & visualize embeddings in R
  • better alternative include the conText package
  • And of of course: many more Python libraries

Summary: Comparison to “traditional” methods

Classic Supervised Classic Unsupervised Embeddings
Methods e.g., SVM e.g., topic modeling e.g., Word2Vec, GloVe
Bag-of-words Yes Yes No
Input DFM, labelled \(y\) DFM FCM
Output Term importance matrix (for class prediction) Document distr. over topics; topic distr. over words Word vectors
Researcher decision e.g., training/test split e.g., number of topics \(K\) e.g., window size, dimensions, model
Validation concerns Fixed procedures Lack of procedures Lack of procedures
Robustness concerns Fixed procedures Lack of procedures Lack of procedures

Note. Table extended based on Rodriguez & Spirling (2022, p. 104)

Promises & Pitfalls

(Arseniev-Koehler, 2022; Blodgett et al., 2020; Bolukbasi et al., 2016; Caliskan et al., 2017; Garg et al., 2018; Grimmer et al., 2022; Jurafsky & Martin, 2023; Rodriguez et al., 2023; Schnabel et al., 2015):

  • ✅ encode semantic similarity: more realistic

  • ✅ more computationally efficient, at least compared to traditional sparse VSM

  • ✅ can (partly) handle unseen vocabulary (e.g., via retrofitting)

  • ✅ can (partly) handle shifts in meaning according to context (contextualized embeddings)

  • ❌ dimensions cannot be interpreted directly: thin theoretical meaning?

  • ❌ reproduce bias in natural language (but debiasing approaches)

  • ❌ proprietary models/data: who can train these?

  • ❌ oftentimes used to explore,rather than for (causal) inference

  • ❌ (procedures for) quality criteria (validity, robustness) not established

The road ahead

Identifying meaning through semantic spaces (embeddings): Overview 📚

  • Gentle introductions: Rodriguez & Spirling (2022); Arseniev-Koehler (2022); Grimmer et al. (2022)

  • Methods: Word2Vec, GloVe, fasttext, A La Carte Embedding, etc.

  • Examplary studies:

    • features in sentiment analysis: Rudkowsky et al. (2018)
    • identification of stereotypes: Garg et al. (2018), Kroon et al. (2019), Müller et al. (2023)
    • identification of shifts over time: Hamilton et al. (2016), Rodman (2020), Kozlowski et al. (2019)

Identifying meaning through semantic spaces (embeddings): Overview 📚

Any questions? 🤔

References

Antoniak, M., & Mimno, D. (2018). Evaluating the Stability of Embedding-based Word Similarities. Transactions of the Association for Computational Linguistics, 6, 107–119. https://doi.org/10.1162/tacl_a_00008
Arseniev-Koehler, A. (2022). Theoretical Foundations and Limits of Word Embeddings: What Types of Meaning can They Capture? Sociological Methods & Research, 004912412211401. https://doi.org/10.1177/00491241221140142
Bail, C. A. (n.d.). Word Embeddings. https://doi.org/10.1201/9781003093459
Benoit, K., Wang, H., & Watanabe, K. (n.d.). Replication: Word embedding (gloVe/word2vec). https://quanteda.io/articles/pkgdown/replication/text2vec.html
Blodgett, S. L., Barocas, S., Daumé Iii, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of Bias in NLP. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. arXiv. http://arxiv.org/abs/1607.04606
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 29). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230
Chung-hong Chan. (2023). Grafzahl: Fine-tuning Transformers fortext data from within R. Computational Communication Research, 5(1), 76. https://doi.org/10.5117/CCR2023.1.003.CHAN
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. http://arxiv.org/abs/1810.04805
Egami, N., Fong, C. J., Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). How to make causal inferences using texts. Science Advances, 8(42), eabg2652. https://doi.org/10.1126/sciadv.abg2652
Feder, A., Keith, K. A., Manzoor, E., Pryzant, R., Sridhar, D., Wood-Doughty, Z., Eisenstein, J., Grimmer, J., Reichart, R., Roberts, M. E., Stewart, B. M., Veitch, V., & Yang, D. (2022). Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Transactions of the Association for Computational Linguistics, 10, 1138–1158. https://doi.org/10.1162/tacl_a_00511
Firth, J. R. (1975). Studies in Linguistic Analysis. Wiley-Blackwell.
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16). https://doi.org/10.1073/pnas.1720347115
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1489–1501. https://doi.org/10.18653/v1/P16-1141
Hvitfeldt, E., & Silge, J. (2022). Supervised Machine Learning for Text Analysis in R. Acoompanying online tutorial, section 5. https://doi.org/10.1201/9781003093459
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf
Jurriaan, N., & Gils, W. van. (2020). NLP with R part 2: Training Word Embedding models and visualize results. https://medium.com/cmotions/nlp-with-r-part-2-training-word-embedding-models-and-visualize-results-ae444043e234
Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., & Arora, S. (2018). A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors. https://doi.org/10.48550/ARXIV.1805.05388
Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). The text-package: An R-package for analyzing and visualizing human language using natural language processing and transformers. Psychological Methods. https://doi.org/10.1037/met0000542
Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review, 84(5), 905–949. https://doi.org/10.1177/0003122419877135
Kroon, A. C., Trilling, D., Meer, T. G.  L.  A. van der, & Jonkman, J. G.  F. (2019). Clouded reality: News representations of culturally close and distant ethnic outgroups. Communications, 0(0). https://doi.org/10.1515/commun-2019-2069
Le, Q., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. In E. P. Xing & T. Jebara (Eds.), Proceedings of the 31st International Conference on Machine Learning (Vol. 32, pp. 1188–1196). PMLR. https://proceedings.mlr.press/v32/le14.html
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv. http://arxiv.org/abs/1310.4546
Müller, P., Chan, C.-H., Ludwig, K., Freudenthaler, R., & Wessler, H. (2023). Differential Racism in the News: Using Semi-Supervised Machine Learning to Distinguish Explicit and Implicit Stigmatization of Ethnic and Religious Groups in Journalistic Discourse. Political Communication, 40(4), 396–414. https://doi.org/10.1080/10584609.2023.2193146
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://doi.org/10.3115/v1/D14-1162
Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). Navigating the Local Modes of Big Data: The Case of Topic Models. In R. M. Alvarez (Ed.), Computational Social Science (pp. 51–97). Cambridge University Press. https://doi.org/10.1017/CBO9781316257340.004
Rodman, E. (2020). A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors. Political Analysis, 28(1), 87–111. https://doi.org/10.1017/pan.2019.23
Rodriguez, P. L., & Spirling, A. (2022). Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research. The Journal of Politics, 84(1), 101–115. https://doi.org/10.1086/715162
Rodriguez, P. L., Spirling, A., & Stewart, B. M. (2023). Embedding Regression: Models for Context-Specific Description and Inference. American Political Science Review, 1–20. https://doi.org/10.1017/S0003055422001228
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, Š., & Sedlmair, M. (2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures, 12(2-3), 140–157. https://doi.org/10.1080/19312458.2018.1455817
Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 298–307. https://doi.org/10.18653/v1/D15-1036
Schweinberger, M. (2023). Semantic vector space models in r. The University of Queensland, Australia. School of Languages; Cultures.
Song, H., Tolochko, P., Eberl, J.-M., Eisele, O., Greussing, E., Heidenreich, T., Lind, F., Galyga, S., & Boomgaarden, H. G. (2020). In Validations We Trust? The Impact of Imperfect Human Annotations as a Gold Standard on the Quality of Validation of Automated Content Analysis. Political Communication, 37(4), 550–572. https://doi.org/10.1080/10584609.2020.1723752
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762
Wendlandt, L., Kummerfeld, J. K., & Mihalcea, R. (2018). Factors Influencing the Surprising Instability of Word Embeddings. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2092–2102. https://doi.org/10.18653/v1/N18-1190
Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Platen, P. von, Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., … Rush, A. M. (2019). HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771